A Political Shift…

As the coming of a younger generation moves into politics1, so has the shift of what you could picture an elected official for the 2018 Elections to look like. Even so, the variance of gender, color, LGBTQ, among other features are still at a smaller percentage than the white male counterpart. There are reasons for this, such as the strong inclination for voters to elect someone that they share an identity with, an elected official who has held the seat for many years, etc.2 We wanted to bring focus to the kind of candidates that are running for the Democratic Primaries across Representative, Governors, Senators, etc. while asking the question:

“How can different factors about a democratic candidate allow us to predict whether he/she wins the primary?”

We are going to predict the outcome with a binary return stating whether the candidate in question will lose or advance in the primaries. After seeing the surge of candidates with uncommon factors making waves in the media on both liberal and conservative platforms, we decided to compare the numbers ourselves to detect if the new candidates are the new norm or exceptions to the rules. We strive to connect with the politically active community, to bring facts to voters, on both sides of the spectrum. Diversity for liberals and conservatives alike bring a unique perspective to their office role, and the uncommon candidate should both be informed whether or not the war has been won, and their diversity truly brings favorable results to their campaign, or if this is just the beginning.

The Data

Introduction to Dataset

To solve our research question, we found one dataset that contains information about the 811 candidates who have appeared on the ballot this year in Democratic primaries for Senate, House and governor, not counting races featuring a Democratic incumbent, as of August 7, 2018. The dataset includes a lot of variables about each democratic condidate. We decided to remove five columns of the original dataset including General.Status, Won.Primary, Race.Type, Partisan.Lean, and Race.Primary.Election.Date becuase they were either reduntant (General status and won primary were the same as primary status but had lots of missing data), or irrelevant(Race type was the same for all condidates and Partisan lean was irrelevant to our research). We replaced the Race Primary Election Date with only the election month since dates were too specific and less relevant. After filtering out irrelevant data and filling in missing data (a lot of the endorsement data were missing because they were either not provided or the candidate simply did not weigh in on the race), the prepared dataset was consist of 811 rows and 28 columns. Below is a sample of our final dataset.

Candidate State District Office.Type Primary.Status Primary.Runoff.Status Primary.. Race Veteran. LGBTQ. Elected.Official. Self.Funder. STEM. Obama.Alum. Party.Support. Emily.Endorsed. Guns.Sense.Candidate. Biden.Endorsed. Warren.Endorsed. Sanders.Endorsed. Our.Revolution.Endorsed. Justice.Dems.Endorsed. PCCC.Endorsed. Indivisible.Endorsed. WFP.Endorsed. VoteVets.Endorsed. No.Labels.Support. Election.Month
Anthony White (Alabama) AL Governor of Alabama Governor Lost None 3.42 Nonwhite Yes No No No No No Neutral Neutral No Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral 6
Christopher Countryman AL Governor of Alabama Governor Lost None 1.74 White No Yes No No No No Neutral Neutral No Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral 6
Doug “New Blue” Smith AL Governor of Alabama Governor Lost None 3.27 White Yes No No No No No Neutral Neutral No Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral 6
James C. Fields AL Governor of Alabama Governor Lost None 8.00 Nonwhite Yes No Yes No No No Neutral Neutral No Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral 6
Sue Bell Cobb AL Governor of Alabama Governor Lost None 28.98 White No No Yes No No No Neutral Neutral No Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral 6
Walt Maddox AL Governor of Alabama Governor Advanced None 54.60 White No No Yes No No No Neutral Neutral Yes Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral Neutral 6

Description of Variables

Among 811 democratic candidates, 265 of them advanced in the recent primary and the rate of winning the primary was 32.7%. As mentioned above, we got rid of six columns and created a new column which was the month of the election. The table below shows the details of the variables we included in our dataset.

Column Description
Candidate All candidates who received votes in 2018’s Democratic primary elections for U.S. Senate, U.S. House and governor in which no incumbent ran. Supplied by Ballotpedia.
State The state in which the candidate ran. Supplied by Ballotpedia.
District The office and, if applicable, congressional district number for which the candidate ran. Supplied by Ballotpedia.
Office Type The office for which the candidate ran. Supplied by Ballotpedia.
Primary Status Whether the candidate lost (“Lost”) the primary or won/advanced to a runoff (“Advanced”). Supplied by Ballotpedia.
Primary Runoff Status “None” if there was no runoff; “On the Ballot” if the candidate advanced to a runoff but it hasn’t been held yet; “Advanced” if the candidate won the runoff; “Lost” if the candidate lost the runoff. Supplied by Ballotpedia.
Primary % The percentage of the vote received by the candidate in his or her primary. In states that hold runoff elections, we looked only at the first round (the regular primary). In states that hold all-party primaries (e.g., California), a candidate’s primary percentage is the percentage of the total Democratic vote they received. Unopposed candidates and candidates nominated by convention (not primary) are given a primary percentage of 100 but were excluded from our analysis involving vote share. Numbers come from official results posted by the secretary of state or local elections authority; if those were unavailable, we used unofficial election results from the New York Times.
Veteran? If the candidate’s website says that he or she served in the armed forces, we put “Yes.” If the website is silent on the subject (or explicitly says he or she didn’t serve), we put “No.” If the field was left blank, no website was available.
LGBTQ? If the candidate’s website says that he or she is LGBTQ (including indirect references like to a same-sex partner), we put “Yes.” If the website is silent on the subject (or explicitly says he or she is straight), we put “No.” If the field was left blank, no website was available.
Elected Official? We used Ballotpedia, VoteSmart and news reports to research whether the candidate had ever held elected office before, at any level. We put “Yes” if the candidate has held elected office before and “No” if not.
Self-Funder? We used Federal Election Committee fundraising data (for federal candidates) and state campaign-finance data (for gubernatorial candidates) to look up how much each candidate had invested in his or her own campaign, through either donations or loans. We put “Yes” if the candidate donated or loaned a cumulative $400,000 or more to his or her own campaign before the primary and “No” for all other candidates.
STEM? If the candidate identifies on his or her website that he or she has a background in the fields of science, technology, engineering or mathematics, we put “Yes.” If not, we put “No.” If the field was left blank, no website was available.
Obama Alum? We put “Yes” if the candidate mentions working for the Obama administration or campaign on his or her website, or if the candidate shows up on this list of Obama administration members and campaign hands running for office. If not, we put “No.”
Dem Party Support? “Yes” if the candidate was placed on the DCCC’s Red to Blue list before the primary, was endorsed by the DSCC before the primary, or if the DSCC/DCCC aired pre-primary ads in support of the candidate. (Note: according to the DGA’s press secretary, the DGA does not get involved in primaries.) “No” if the candidate is running against someone for whom one of the above things is true, or if one of those groups specifically anti-endorsed or spent money to attack the candidate. If those groups simply did not weigh in on the race, we left the cell blank.
Emily Endorsed? “Yes” if the candidate was endorsed by Emily’s List before the primary. “No” if the candidate is running against an Emily-endorsed candidate or if Emily’s List specifically anti-endorsed or spent money to attack the candidate. If Emily’s List simply did not weigh in on the race, we left the cell blank.
Gun Sense Candidate? “Yes” if the candidate received the Gun Sense Candidate Distinction from Moms Demand Action/Everytown for Gun Safety before the primary, according to media reports or the candidate’s website. “No” if the candidate is running against an candidate with the distinction. If Moms Demand Action simply did not weigh in on the race, we left the cell blank.
Biden Endorsed? “Yes” if the candidate was endorsed by Joe Biden before the primary. “No” if the candidate is running against a Biden-endorsed candidate or if Biden specifically anti-endorsed the candidate. If Biden simply did not weigh in on the race, we left the cell blank.
Warren Endorsed? “Yes” if the candidate was endorsed by Elizabeth Warren before the primary. “No” if the candidate is running against a Warren-endorsed candidate or if Warren specifically anti-endorsed the candidate. If Warren simply did not weigh in on the race, we left the cell blank.
Sanders Endorsed? “Yes” if the candidate was endorsed by Bernie Sanders before the primary. “No” if the candidate is running against a Sanders-endorsed candidate or if Sanders specifically anti-endorsed the candidate. If Sanders simply did not weigh in on the race, we left the cell blank.
Our Revolution Endorsed? “Yes” if the candidate was endorsed by Our Revolution before the primary, according to the Our Revolution website. “No” if the candidate is running against an Our Revolution-endorsed candidate or if Our Revolution specifically anti-endorsed or spent money to attack the candidate. If Our Revolution simply did not weigh in on the race, we left the cell blank.
Justice Dems Endorsed? “Yes” if the candidate was endorsed by Justice Democrats before the primary, according to the Justice Democrats website, candidate website or news reports. “No” if the candidate is running against a Justice Democrats-endorsed candidate or if Justice Democrats specifically anti-endorsed or spent money to attack the candidate. If Justice Democrats simply did not weigh in on the race, we left the cell blank.
PCCC Endorsed? “Yes” if the candidate was endorsed by the Progressive Change Campaign Committee before the primary, according to the PCCC website, candidate website or news reports. “No” if the candidate is running against a PCCC-endorsed candidate or if the PCCC specifically anti-endorsed or spent money to attack the candidate. If the PCCC simply did not weigh in on the race, we left the cell blank.
Indivisible Endorsed? “Yes” if the candidate was endorsed by Indivisible before the primary, according to the Indivisible website, candidate website or news reports. “No” if the candidate is running against an Indivisible-endorsed candidate or if Indivisible specifically anti-endorsed or spent money to attack the candidate. If Indivisible simply did not weigh in on the race, we left the cell blank.
WFP Endorsed? “Yes” if the candidate was endorsed by the Working Families Party before the primary, according to the WFP website, candidate website or news reports. “No” if the candidate is running against a WFP-endorsed candidate or if the WFP specifically anti-endorsed or spent money to attack the candidate. If the WFP simply did not weigh in on the race, we left the cell blank.
VoteVets Endorsed? “Yes” if the candidate was endorsed by VoteVets before the primary, according to the VoteVets website, candidate website or news reports. “No” if the candidate is running against a VoteVets-endorsed candidate or if VoteVets specifically anti-endorsed or spent money to attack the candidate. If VoteVets simply did not weigh in on the race, we left the cell blank.
No Labels Support? “Yes” if a No Labels-affiliated group (Citizens for a Strong America Inc., Forward Not Back, Govern or Go Home, United for Progress Inc. or United Together) spent money in support of the candidate in the primary. “No” if the candidate is running against an candidate supported by a No Labels-affiliated group or if a No Labels-affiliated group specifically anti-endorsed or spent money to attack the candidate. If No Labels simply did not weigh in on the race, we left the cell blank.
Election month The month of the election

Exploring The Dataset

In order to narrow down the features we want to focus on, we will explore covariate values we found as an additive to building our statistical model.

LGBTQ+

29% of LGBTQ advanced in the primaries, which is little less than 2% of the number of candidates that ran overall in the primaries. Perhaps success for the LGBTQ in the Democratic Party is a growing trend in comparison to previous years but is not projecting to be the best factor in predicting a successful candidate.

Race

First, it should be noted that 32% of all candidates advanced. Therefore, if race was not related to the likelihood that a candidate advances, then 32% of each race group (White, nonwhite, and unknown) would advance. To look at the overall makeup of the candidates, the pie chart shows that 55.7% of the candidates are white, 19.2% are nonwhite, and 25% are unknown.

In the stacked bar chart that groups by Primary Status and shows the percentages of each for each bar, 59.2% of the advanced candidates were white, 23.8% were nonwhite, and 17.0% were unknown. Since these proportions are not equal the overall makeup, race could be an influential factor on the primary status of a candidate. At the same time, these proportions are not that far off from the overall makeup, and therefore may be a less defining factor when predicting whether a candidate advanced. To see this in a different way, the stacked bar chart that groups first by race and stacks by primary status shows that the percentage of white that advanced is higher than any other race category that advanced.

Office Type

The democratic candidates include 100 governors, 687 representatives, and 24 senators. Among the candidates within each office type, 0.33 senators and 0.34 representatives advanced in the primary while only 0.22 governors advanced, potentially indicating a lower possibility of winning the primary if the candidate was a governor.

Machine Learning

1. Introduction to K-Nearest-Neighbors (KNN)

K-nearest-neighbors (KNN) is a robust and versatile classifier that can be used for both classification and regression predictive problems. In order to use KNN, we are given a labelled dataset consisting of training observations (x, y) and would like to capture the relationship between x and y. In this example we will use x to denote a feature (i.e., an attribute, what we are using to predict) and y to denote the outcome we are trying to predict.

In the classification setting, the K-nearest neighbor algorithm essentially boils down to forming a majority vote between the K most similar instances to a given “unseen” observation. One of the pros of using a KNN model is that it can learn complex decision boundaries on the fly.It is worth noting that the minimal training phase of KNN comes both at a memory cost, since we must store a potentially huge data set, as well as a computational cost during test time since classifying a given observation requires a run down of the whole data set.

Apply KNN to Our Dataset

We ended up with 811 colleges which were used to create our model. We split this data into 70% training, and 30% test data, then ran 10-fold cross validation using a k nearest neighbors algorithm to predict if a candidate would advance in the primary. By using recursive feature elimination, we determined 10 features to predict the primary outcome: ‘Yes.Endorsements’,‘No.Endorsements’, ‘Office.Type_Governor’, ’STEM._Yes’,’STEM._No’,’LGBTQ._Yes’,’Party.Support._No’, ’Party.Support._Yes’,’Self.Funder._Yes’,’Elected.Official._Yes’. By using the grid search, we were able to determine the best parameters for our model which the n_neighbors was 24 and the weights were uniform. After using the pipeline to apply KNN to our dataset, we were able accurately to predict 67.8% of the primary outcome.

2. Introduction to Random-Forest-Classifer(RFC)

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

Apply RFC to Our Dataset

We used the same 10 features with the KNN model because we wanted to compare the accuracy scores of the two models. Before doing any cross validation or grid research, we were able to accurately predict 73% of the outcome. Then we ran a 10-fold cross validation and the mean score of 74%. By using grid search, we found the best parameters of the RFC which the criterion was gini, max depth was 5, max features were auto and the number of estimators was 500. After applying the trained model to our dataset, we were able to accurately predict 75.6% of the outcome.

Comparison and Visualization

The following two graphs depicted are comparing the number of accurate and inaccurate matches of the machine learning algorithm used to predict and the outcomes. Both algorithms show a consistent pattern of predicting the candidates who lost the primary, with Random Forest Regression at 70% and K Nearest Neighbors at 54%.

This bar graph depicts the overall accuracy scores received by the algorithms in comparison to the test outcomes.

Statistical Modeling

Confusion matrix:

A confusion matrix shows the intersection of the “true outcomes” (rows) vs. the “predicted” (columns) and how they overlap which can show the values for true/false negatives/positives.
\[\left[\begin{array} {rrr} 157 & 13 \\ 50 & 24 \\ \end{array}\right]\]

Sensitivity and Specificity:

Limitations

Our dataset lacked certain variables that we thought would be useful features in our analysis. Having gender could track the likeliness of men or women winning or losing the primaries. The source where we took our dataset from stated they had gender at one point in their content description, but did not show up in the csv. Age was another feature that could be useful in understanding if diversity is higher in younger candidates vs older candidates, or reflecting the amount of people running for re-election and that age range. This dataset was focused on the 2018 Democratic Primaries, so we could not use previous year’s data to fill in missing values. If there were too many missing values for one column, we ultimately did not use that factor. As such, if we applied dummy variables to some features, say for instance State, there would be 50 columns. Any features that resulted in an overwhelming amount of columns was not utilized in analysis.

Conclusion

Method Accuracy Score
KNN 0.68
RFC 0.76
Logistic 0.74

All of our methods (KNN, RFC, and Logistic regression) had very similar accuracy scores, with RFC being the best model for predicting the primary status of whether or not the democratic candidate advanced based on the features: ‘Yes.Endorsements’,‘No.Endorsements’, ‘Office.Type_Governor’, ’STEM._Yes’,’STEM._No’,’LGBTQ._Yes’,’Party.Support._No’. Despite having some limitations in the lack of features that were given, we were happy to see that we were able to predict with a satisfactory accuracy score of 76% based on factors such as endorsements, office type, STEM, LGBTQ, and party support, which supported our original hypothesis that some of the given factors would be able to give us an idea of whether a democratic candidate would advance. With the increasing attention to technology and LGBTQ rights, factors as STEM and LGBTQ were not a surprise to us, although these might not have been as important in past elections. Overall, this model and findings can help people who are politically active or want a better understanding of the political environment of which factors may be important to democrats when voting for candidates.